Goto

Collaborating Authors

 base pair


Learning to Discover Regulatory Elements for Gene Expression Prediction

Su, Xingyu, Yu, Haiyang, Zhi, Degui, Ji, Shuiwang

arXiv.org Artificial Intelligence

We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Exp ression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/). Gene expression serves as a fundamental process that dictates cellular function and variability, providing insights into the mechanisms underlying development (Pratapa et al., 2020), disease (Cook-son et al., 2009; Emilsson et al., 2008), and responses to external factors (Schubert et al., 2018). Despite the critical importance of gene expression, predicting it from genomic sequences remains a challenging task due to the complexity and variability of regulatory elements involved. Recent advances in deep learning techniques (Avsec et al., 2021; Gu & Dao, 2023; Nguyen et al., 2024; Badia-i Mompel et al., 2023) have shown remarkable capabilities and performance in modeling long sequential data like language and DNA sequence. By capturing intricate dependencies within ge-nomic data, these techniques provide a deeper understanding of how regulatory elements contribute to transcription (Aristizabal et al., 2020). To predict gene expression, DNA language models are usually applied to encode long DNA sequences with a subsequent predictor to estimate the gene expression values (Avsec et al., 2021; Nguyen et al., 2024; Gu & Dao, 2023; Schiff et al., 2024).


METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Liu, Ollie, Jaghouar, Sami, Hagemann, Johannes, Wang, Shangshang, Wiemels, Jason, Kaufman, Jeff, Neiswanger, Willie

arXiv.org Artificial Intelligence

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.


Finding Closure: A Closer Look at the Gestalt Law of Closure in Convolutional Neural Networks

Zhang, Yuyan, Soydaner, Derya, Koßmann, Lisa, Behrad, Fatemeh, Wagemans, Johan

arXiv.org Artificial Intelligence

The human brain has an inherent ability to fill in gaps to perceive figures as complete wholes, even when parts are missing or fragmented. This phenomenon is known as Closure in psychology, one of the Gestalt laws of perceptual organization, explaining how the human brain interprets visual stimuli. Given the importance of Closure for human object recognition, we investigate whether neural networks rely on a similar mechanism. Exploring this crucial human visual skill in neural networks has the potential to highlight their comparability to humans. Recent studies have examined the Closure effect in neural networks. However, they typically focus on a limited selection of Convolutional Neural Networks (CNNs) and have not reached a consensus on their capability to perform Closure. To address these gaps, we present a systematic framework for investigating the Closure principle in neural networks. We introduce well-curated datasets designed to test for Closure effects, including both modal and amodal completion. We then conduct experiments on various CNNs employing different measurements. Our comprehensive analysis reveals that VGG16 and DenseNet-121 exhibit the Closure effect, while other CNNs show variable results. We interpret these findings by blending insights from psychology and neural network research, offering a unique perspective that enhances transparency in understanding neural networks. Our code and dataset will be made available on GitHub.


{\alpha}-HMM: A Graphical Model for RNA Folding

Zhang, Sixiang, Yang, Aaron J., Cai, Liming

arXiv.org Artificial Intelligence

The secondary structure of a ribonucleic acid (RNA) is higher order structure over the primary sequence of the molecule. Nucleotides on the sequence physically come close to each other through hydrogen bonds between bases, forming canonical Watson-Crick pairs (A-U and G-C), and the wobble pair (G-U) as the fundamental components of the structure. The secondary structure is the intermediate, to a great extent the scaffold for higher order interactions between nucleotides to generate RNA tertiary, i.e., 3-dimensional, structure [7, 17]. The latter determines important RNA functions in biological processes, not only as a genetic information carrier but also playing catalytic, scaffolding, structural, and regulatory roles [12, 4]. There has been abundant interest in understanding the detailed process and dynamics of how RNA folds into its structure [3]. Computational prediction of RNA secondary structure directly from its primary sequence is a very desirable step toward the prediction of RNA 3D structure. This is evident by the RNA Puzzles, an annual competition to predict RNA 3D structures, in which most of the used methods by participants are proceeded by a phase for secondary structure prediction [24, 22, 23]. These authors contributed equally to this work.


Rethinking Performance Measures of RNA Secondary Structure Problems

Runge, Frederic, Franke, Jörg K. H., Fertmann, Daniel, Hutter, Frank

arXiv.org Artificial Intelligence

Accurate RNA secondary structure prediction is vital for understanding cellular regulation and disease mechanisms. Deep learning (DL) methods have surpassed traditional algorithms by predicting complex features like pseudoknots and multi-interacting base pairs. However, traditional distance measures can hardly deal with such tertiary interactions and the currently used evaluation measures (F1 score, MCC) have limitations. We propose the Weisfeiler-Lehman graph kernel (WL) as an alternative metric. Embracing graph-based metrics like WL enables fair and accurate evaluation of RNA structure prediction algorithms. Further, WL provides informative guidance, as demonstrated in an RNA design experiment.


Description Generation using Variational Auto-Encoders for precursor microRNA

Petković, Marko, Menkovski, Vlado

arXiv.org Artificial Intelligence

Micro RNAs (miRNA) are a type of non-coding RNA, which are involved in gene regulation and can be associated with diseases such as cancer, cardiovascular and neurological diseases. As such, identifying the entire genome of miRNA can be of great relevance. Since experimental methods for novel precursor miRNA (pre-miRNA) detection are complex and expensive, computational detection using ML could be useful. Existing ML methods are often complex black boxes, which do not create an interpretable structural description of pre-miRNA. In this paper, we propose a novel framework, which makes use of generative modeling through Variational Auto-Encoders to uncover the generative factors of pre-miRNA. After training the VAE, the pre-miRNA description is developed using a decision tree on the lower dimensional latent space. Applying the framework to miRNA classification, we obtain a high reconstruction and classification performance, while also developing an accurate miRNA description.


ViDa: Visualizing DNA hybridization trajectories with biophysics-informed deep graph embeddings

Zhang, Chenwei, Lovrod, Jordan, Beronov, Boyan, Duc, Khanh Dao, Condon, Anne

arXiv.org Artificial Intelligence

Visualization tools can help synthetic biologists and molecular programmers understand the complex reactive pathways of nucleic acid reactions, which can be designed for many potential applications and can be modelled using a continuous-time Markov chain (CTMC). Here we present ViDa, a new visualization approach for DNA reaction trajectories that uses a 2D embedding of the secondary structure state space underlying the CTMC model. To this end, we integrate a scattering transform of the secondary structure adjacency, a variational autoencoder, and a nonlinear dimensionality reduction method. We augment the training loss with domain-specific supervised terms that capture both thermodynamic and kinetic features. We assess ViDa on two well-studied DNA hybridization reactions. Our results demonstrate that the domain-specific features lead to significant quality improvements over the state-of-the-art in DNA state space visualization, successfully separating different folding pathways and thus providing useful insights into dominant reaction mechanisms.


Visualizing DNA reaction trajectories with deep graph embedding approaches

Zhang, Chenwei, Duc, Khanh Dao, Condon, Anne

arXiv.org Artificial Intelligence

Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well-studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization.


Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision

Malusare, Aditya, Kothandaraman, Harish, Tamboli, Dipesh, Lanman, Nadia A., Aggarwal, Vaneet

arXiv.org Artificial Intelligence

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) identification of biological function annotations of genomic sequences, (3) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.


Blind Biological Sequence Denoising with Self-Supervised Set Learning

Ng, Nathan, Park, Ji Won, Lee, Jae Hyeon, Kelly, Ryan Lewis, Ra, Stephen, Cho, Kyunghyun

arXiv.org Artificial Intelligence

Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.